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Abstract. We present a Bayesian non-negative tensor factorization model for 
count-valued tensor data, and develop scalable inference algorithms (both batch 
and online) for dealing with massive tensors. Our generative model can handle 
overdispersed counts as well as infer the rank of the decomposition. Moreover, 
leveraging a reparameterization of the Poisson distribution as a multinomial fa¬ 
cilitates conjugacy in the model and enables simple and efficient Gibbs sampling 
and variational Bayes (VB) inference updates, with a computational cost that only 
depends on the number of nonzeros in the tensor. The model also provides a nice 
interpretability for the factors; in our model, each factor corresponds to a “topic”. 

We develop a set of online inference algorithms that allow further scaling up the 
model to massive tensors, for which batch inference methods may be infeasible. 

We apply our framework on diverse real-world applications, such as multiway 
topic modeling on a scientific publications database, analyzing a political science 
data set, and analyzing a massive household transactions data set. 

Keywords: Tensor factorization, Bayesian learning, latent factor models, count 
data, online Bayesian inference 

1 Introduction 

Discovering interpretable latent structures in complex multiway (tensor) data is an im¬ 
portant problem when learning from polyadic relationships among multiple sets of ob¬ 
jects. Tensor factorization ma offers a promising way of extracting such latent struc¬ 
tures. The inferred factors can be used to analyze objects in each mode of the tensor 
(e.g., via classification or clustering using the factors), or to do tensor completion. 

Of particular interest, in the context of such data, are sparsely-observed count¬ 
valued tensors. Tensors are routinely encountered in many applications. For example, 
in analyzing a database of scientific publications, the data may be in form of a sparse 
four-way count-valued tensor (authors x words x journals x years). Another applica¬ 
tion where multiway count data is routinely encountered is the analysis of contingency 
tables ED which represent the co-occurrence statistics of multiple sets of objects. 

We present a scalable Bayesian model for analyzing such sparsely-observed tensor 
data. Our framework is based on a beta-negative binomial construction, which provides 
a principled generative model for tensors with sparse and potentially overdispersed 
count data, and produces a non-negative tensor factorization. In addition to perform¬ 
ing non-negative tensor factorization and tensor completion for count-valued tensors, 
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our model has the property that each latent factor inferred for a tensor mode also repre¬ 
sents a distribution (or “topic”, as in topic models) over the objects of that tensor mode; 
our model naturally accomplishes this by placing a Dirichlet prior over the columns 
of the factor matrix of each tensor mode. In addition to providing an expressive and 
interpretable model for analyzing sparse count-valued tensors, the model automatically 
infers the rank of the decomposition, which side-steps the crucial issue of pre-specifying 
the rank of the decomposition II141181221 . 

Our framework also consists of a set of batch and scalable online inference methods. 
Using a reparameterization of the Poisson distribution as a multinomial allows us to 
achieve conjugacy, which facilitates closed-form Gibbs sampling as well as variational 
Bayes (VB) inference. Moreover, we also develop two online inference algorithms - one 
based on online MCMC f7] and the other based on stochastic variational inference |9|- 
These inference algorithms enable scaling up the model to massive-sized tensor data. 

One of the motivations behind our work is analyzing massive multiway data for 
tasks such as understanding thematic structures in scholarly databases (e.g., to de¬ 
sign better recommender systems for scholars), understanding consumer behavior from 
shopping patterns of large demographies (e.g., to design better marketing and supply 
strategies), and understanding international relations in political science studies. In our 
experiments, we provide qualitative analyses for such applications on large-scale real- 
world data sets, and the scalability behavior of our model. 

2 Canonical PARAFAC Decomposition 

Given a tensor y of size n\ x ri 2 x • • • x n k, with rife denoting the size of y along 
the k th mode (or “way”) of the tensor, the goal in a Canonical PARAFAC (CP) decom¬ 
position nn is to decompose y into a set of K factor matrices ,..., U'- K 1 where 
..., tt^], k = {1,..., K}, denotes the rife x R factor matrix associ¬ 
ated with mode k. In its most general form, CP decomposition expresses the tensor y 
via a weighted sum of R rank-1 tensors as y ~ ^ Q ... (3 ui K ' > ). The 

form of / depends on the type of data being modeled (e.g., / can be Gaussian for real¬ 
valued, Bernoulli-logistic for binary-valued, Poisson for count-valued tensors). Here X r 
is the weight associated with the r th rank-1 component, the rife x 1 column vector u'r ' 1 
represents the r th latent factor of mode k, and 0 denotes vector outer product. 

3 Beta-Negative Binomial CP Decomposition 

We focus on modeling count-valued tensor data B) and assume the following generative 
model for the tensor y 


R 


y - 

- Poiscy A r .u^ © ... 0 u 

(1) 

, 

-Dir (a (fc) ,...,a (fe) ) 

(2) 

Ap n 

- Gamma(sy, ^ ) 

(3) 

Pr - 

- Beta(ce, c(l — e)) 

(4) 
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We use subscript i = {U,..., ix} to denote the index of the i-th entry in y. Using 
this notation, the i-th entry of the tensor can be written as yi ~ Pois(^^ =1 X r II^=i u itr) 
We assume that we are given N observations {yi}^ =1 from the tensor y. 

Since the gamma-Poisson mixture distribution is equivalent to a negative binomial 
distribution MT31l , ([TJ and Q, coupled with the beta prior (Eq[4]i on p r , lead to what we 
will call the beta-negative binomial CP (BNBCP) decomposition model. A few things 
worth noting about our model are 


- The Dirichlet prior on the factors u, k) naturally imposes non-negativity constraints El 

on the factor matrices U^,..., Moreover, since each column ui k> of these 

(k) 

factor matrices sums to 1, u r can also be thought of a distribution (e.g., a “topic”) 
over the entities in mode k. 

- The gamma-beta hierarchical construction of A r (Eq[3]and [4]) allows inferring the 
rank of the tensor by setting an upper bound R on the number of factors and letting 
the inference procedure infer the appropriate number of factors by shrinking the 
coefficients A r ’s to close to zero for the irrelevant factors. 

- The resulting negative binomial model is useful for modeling overdispersed count 
data in cases where the Poisson likelihood may not be suitable. 

- Using alternate parameterizations (Section 3.1 1 of the Poisson distribution in (JTJ 
leads to a fully conjugate model and facilitates efficient Gibbs sampling and varia¬ 
tional Bayes (VB) inference, in both batch as well as online settings. 


3.1 Reparametrizing the Poisson Distribution 

The generative model described in Eq is not conjugate. We now describe two 

equivalent parametrizations 161241 of (|T|). which transform 0-0 into a fully conjugate 
model and facilitate easy-to-derive and scalable inference procedures. These parame¬ 
terizations are based on a data augmentation scheme described below. 

The first parametrization expresses the i-th count-valued entry iji of the tensor y as 
a sum of R latent counts {yir^-i K 

Vi y ir ’ ^ ~ Pois (Ar 0 «-fcr) ( 5 ) 

r =1 k—1 

The second parametrization assumes the vector {yi r }i Li of latent counts is drawn 
from a multinomial as 


Hill ■ ■ ■ ; ViR. ^ Mtllt(y2; Cili ■ ■ • > C iR.) 


C ir — 


rifc=l (| 


R \ T~\ K (k) 

X/r=l ^r n*:=l U i k r 


( 6 ) 


The above parameterizations follows from the following lemma I6l24ft : 


Lemma 1. Suppose that Xi,... ,Xr are independent random variables with x r ~ 
Pois(9 r ) and x = x r- Set 6 = Y^= i l et ( z i z ii ■ ■ ■ > Z R ) be another set of 

random variables such that z ~ Pois{6), and (zi,..., Zr)\z ~ Mult(z ; ..., ^p). 

Then the distribution of x = (x,x±,... ,Xjf) is the same as the distribution of z = 
(z,zi,...,z R ). 
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(k) 

These parameterizations, along with the fact that the columns u r of each factor 
matrix are drawn from a Dirichlet, allows us to leverage the Dirichlet-multinomial con- 
jugacy and derive simple Gibbs sampling and variational Bayes (VB) inference update 
equations, as described in Section[4] 

4 Inference 

We first present the update equations for batch Gibbs sampling (Section pTT) and batch 
VB inference (Section [4~2| ). We then present two online inference algorithms, based on: 
(?) conditional density filtering J7j, which provides an efficient way to perform online 
MCMC sampling using conditional sufficient statistics of the model parameters; and 
(ii) stochastic variational inference ED, which will allow scaling up VB inference by 
processing data in small minibatches. 

We also define two quantities = Y2ii k =j Vir and s r = yi^ r which denote 
aggregates (sufficient statistics) computed using the latent counts iji r . These quantities 
appear at various places in the description of the inference algorithms we develop. 


4.1 Gibbs Sampling 

- Sampling yi r : The latent counts {j/i,-}^Li are sampled from a multinomial (JbJ. 

(k) 

- Sampling u r : Due to the Dirichlet-multinomial conjugacy, the columns of each 
factor matrix have Dirichlet posterior and are sampled as 


,(fc) 


Dir(a 


( fc ) , e ( fc ) „(*0 


5 l,ri ' 


(fc) 

s 2 ri ■ 


,a 


(k) I „(fc) 


(7) 


v . . (k) 

- Sampling p r : Using the fact that s r = jTV Vi,r an d marginalizing over the u ikr ’s 
in (|5|, we have s r ~ Pois(A r ). Using this, along with |[3}, we can express s r using 
a negative binomial distribution, i.e., s r ~ NB( r/ r , p r ). Then, due to the conjugacy 
between negative binomial and beta, we can sample p r as 

p r ~ Beta(ce + s r , c(l — e) + g r ) (8) 

- Sampling \ r : Again using the fact that s r ~ Pois(A r ), and due to the gamma- 
Poisson conjugacy, we have 

X r ~ Gamma(p r + s r ,p r ) (9) 


Computational Complexity: Sampling the latent counts {yi r }r Li for each nonzero 
observation iji (note that for iji = 0, the latent counts are trivially zero) requires 
computing {Cirl^Li, an d computing each Q r requires O(K) time (Eq [6j». Therefore, 
sampling all the latent counts {yi r }l Li requires O(NRK) time. Sampling the latent 
factors {u { r k) }? =1 for the K tensor modes requires O(RK) time. Sampling {p r }^ = i 
and | A r requires O(Ji) time each. Of all these steps, sampling the latent counts 
{Vir}i Li (which are also used to compute the sufficient statistics s-'' k and s r ) is the 
most dominant step, leading to an overall time-complexity of O(NRK) for the Gibbs 
sampling procedure. 
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The linear dependence on N (number of nonzeros) is especially appealing because 
most real-world count-valued tensors are extremely sparse (have much less than even 
1% nonzeros. In contrast to the standard negative-binomial models for count data, 
for which the inference complexity also depends on the zeros whose number may 
be massive (and therefore heuristics, such as subsampling the zeros, are needed), the 
reparametrizations (Section |3. 1 [ ) used by our model allow us to ignore the zeros in the 
multinomial sampling step (the sufficient statistics do not depend on the zero entries in 
the tensor), thereby significantly speeding up the inference. 

4.2 Variational Bayes Inference 

Using the mean-field assumption [12], we approximate the target posterior distribu¬ 
tion by Q = Ui,r QiVir) IL.r l( u r k) ) Ilr '/(•V) IL <l(Pr)- Our fully conjugate model 
enables closed-form variational Bayes (VB) inference updates, with the distribution 
qijjir ), q{u[ k ^), q{ A r ), and q(p r ) being multinomial, Dirichlet, beta, and gamma, re¬ 
spectively. We summarize the update equations for the variational parameters of each 
of these distributions, below: 

- Updating jji r : Using ||6j, the updates for yi r are given by £[//,>] = Q r where Q r 

is defined as Q /r = - and Q r can be computed as 

2^r= 1 Cir 

K K 

Cir = exp{<^(s r + 5r .) + ln(p r ) + ^ +a (fc) )+ a(fc) )]} ( 10 ) 

k —1 k—1 

where , I / (.) is the digamma function, which is the first derivative of the logarithm 
of the gamma function. 

- Updating u\ r : The mean-field posterior q{u r ) is Dirichlet with each ol the com- 

(fc) 

ponent means given by E[it^] = P ' kr {k) where p^ r = a + s[ k \ r . 

- Updating p r : The mean-field posterior q{p r ) is beta with mean given by Ejp r ] = 
P Z+Prb where Vra = ce + s r , p rb = c(l - e) + fly. 

- Updating A r : The mean-field posterior q(X r ) is gamma with mean given by E[A r ] = 
AraA r b, where A ra = ( g r + s r ) and X rb = p r . 

A note on Gibbs vs VB: The per-iteration time-complexity of the VB inference 
procedure is also O(NRK). It is to be noted however that, in practice, one iteration of 
VB in this model is a bit more expensive than one iteration of Gibbs, due to the digamma 
function evaluation for the (i r which is needed in VB when updating the yir s. Prior 
works on Bayesian inference for topic models J8l also support this observation. 

4.3 Online Inference 

Batch Gibbs (Section |4~T| ) and VB (Section [4~2| inference algorithms are simple to im¬ 
plement and efficient to run on moderately large-sized problems. These algorithms can 
however be slow to run for massive data sets (e.g., where the number of tensor en¬ 
tries N and/or the dimension of the tensor is massive). The Gibbs sampler may exhibit 





6 


Hu, Rai, Chen, Harding, Carin 


slow mixing and the batch VB may be slow to converge. To handle such massive ten¬ 
sor data, we develop two online inference algorithms. The first is online MCMC based 
conditional density filtering {7[, while the second is based on stochastic variational in¬ 
ference © . Both these inference algorithms allow processing data in small minibatches 
and enable our model to analyze massive and/or streaming tensor data. 

Conditional Density Filtering The conditional density filtering (CDF) algorithm |7) 
for our model selects a minibatch of tensor entries at each iteration, samples the latent 
counts for these entries conditiond on the previous estimates of the model 

parameters, updates the sufficient statistics s.- r ' and s r using these latent counts (as 
described below), and resamples the model parameters conditioned on these sufficient 
statistics. Denoting I t as data indices in minibatch at round t, the algorithm proceeds as 

- Sampling yi r : For all i £ I t , sample the latent counts y ir( - ieh) usin 8 

- Updating the conditional sufficient statistics: Using data from the current mini¬ 
batch, update the conditional sufficient statistics as: 


<1-7^ 

i£lt'-ik=3 

(ID 

(1 -7t)4 t_1) +r yt^ 

(12) 


ieit 


Note that the updated conditional sufficient statistics (CSS), indexed by superscript 
f, is a weighted average of the old CSS, indexed by superscript t — 1, and of that 
computing only using the current minibatch (of size B ). In addition, the latter term 
is further weighted by N/B so as to represent the average CSS over the entire data. 
In the above, 7 t is defined as 7 t = (to + t)~ K , 1 0 > 0, and k £ (0.5,1] is needed to 
guarantee convergence 0. 

(k) 

- Updating u y r , p r . A r : Using the updated CSS, draw M samples for each of the 


model parameters {u ! 'r ' m> 

, pi'" 1 , Ar '" 1 from the following conditionals: 


4 fc) ~ 

Dir (a( fc ) + S J i) ,...,« (t) +^>) 

(13) 

Pr ~ 

Beta(ce + s®, c(l — e) + g r ) 

(14) 


A r ~ Gamma(g r + s^\p r ) (15) 

(k) 

and either store the sample averages of u r . p r , and A r , or their analytic means to 
use for the next CDF iteration 0 . Since the analytic means of the model parameters 
are available in closed-form in this case, we use the latter option, which obviates 
the need to draw M samples, thereby also speeding up the inference significantly. 

We next describe the stochastic (online) VB inference for our model. 

Stochastic Variational Inference The batch VB inference (Section [4~2| requires using 
the entire data for the parameter updates in each iteration, which can be computation¬ 
ally expensive and can also result in slow convergence. Stochastic variational inference 
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(SVI), on the other hand, leverages ideas from stochastic optimization a and, in each 
iteration, uses a small randomly chosen minibatch of the data to updates the parameters. 
Data from the current minibatch is used to compute stochastic gradients of the varia¬ 
tional objective w.r.t. each of the parameters and these gradients are subsequently used 
in the parameter updates. For our model, the stochastic gradients depend on the suffi¬ 
cient statistics computed using the current minibatch I t : s'- f ' = J2iei t -i k =j Vir and 
„(*) _ 
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Denoting B as the minibatch 


Sr ' = 2-*iiei t Vi,r’ where yi r is computed using Eq 
size, we reweight these statistics by N/B to compute the average sufficient statistics 
over the entire data 0 and update the other variational parameters as follows: 


pff = (1 - lt)pf k r~ l) + 7t(a (fc) + (N/B)s[ k J) (16) 

Prl = (1 - 'Yt)p ( ra 1) + 7t(ce + (N/B)s ( r t) ) (17) 

Prb = (! - lt)p ( r b ~ 1) + 7i(c(l - e) + g r ) (18) 

= (1 - + lt{9r + (N/B)sP) (19) 

= (l-7t)A^ 1) +7tPr (20) 


where 7 1 is defined as 7 1 = (£ 0 + t) K , to > 0, and k G (0.5,1] is needed to guarantee 
convergence 0 . 


Computational Complexity: In contrast to the batch Gibbs and batch VB, both of 
which have O(NRK) cost per-iteration, the per-iteration cost of the online inference 
algorithms (CDF and SVI) is 0(\It\RK) where \I t \ is the minibatch size at round t. We 
use a fixed minibatch size B for each minibatch, so the per-iteration cost is O(BRK). 

5 Related Work 

Although tensor factorization methods have received considerable attention recently, 
relatively little work exists on scalable analysis of massive count-valued tensor data. 
Most of the recently proposed methods for scalable tensor decomposition II13121 101171 
are based on minimizing the Frobenious norm of the tensor reconstruction error, which 
may not be suitable for count or overdispersed count data. The rank of decomposition 
also needs to be pre-specified, or chosen via cross-validation. Moreover, these methods 
assume the tensor to be fully observed and thus cannot be used for tensor completion 
tasks. Another key difference between these methods and ours is that scaling up these 
methods requires parallel or distributed computing infrastructure, whereas our fully 
Bayesian method exhibits excellent scalability on a single machine. At the same time, 
the simplicity of the inference update equations would allow our model to be easily 
parallelized or distributed. We leave this possibility to future work. 

One of the first attempts to explicitly handle count data in the context of non¬ 
negative tensor factorization includes the work of j4j, which is now part of the Tensor 
Toolbox^ This method optimizes the Poisson likelihood, using an alternating Poisson 
regression sub-routine, with non-negative constraints on the factor matrices. However, 
this method requires the rank of the decomposition to be specified, and cannot handle 

1 http://www.sandia.gov/~tgkolda/TensorToolbox/index-2.6.html 
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missing data. Due to its inability in handling missing data, for our experiments (Sec¬ 
tion [6]), as a baseline, we implement and use a Bayesian version of this model which 
can handle missing data. 

Among other works of tensor factorization for count data, the method in m can deal 
with missing values, though the rank still needs to be specified, and moreover the factor 
matrices are assumed to be real-valued, which makes it unsuitable for interpretability 
of the inferred factor matrices. 

In addition to the Poisson non-negative tensor factorization method of 0, some 
other non-negative tensor factorization methods tf2 1 15l20 1 also provide interpretability 
for the factor matrices. However, these methods usually have one or more of the fol¬ 
lowing limitations: (1) there is no explicit generative model for the count data, (2) the 
rank needs to be specified, and (3) the methods do not scale to the massive tensor data 
sets of scales considered in this work. 

Methods that facilitate a full Bayesian analysis for massive count-valued tensors, 
which are becoming increasingly prevalent nowadays, are even fewer. A recent attempt 
on Bayesian analysis of count data using Poisson likelihood is considered in lfl9ll : how¬ 
ever, unlike our model, their method cannot infer the rank and relies on batch VB infer¬ 
ence, limiting its scaling behavior. Moreover, the Poisson likelihood may not be suitable 
for overdispersed counts. 

Finally, inferring the rank of the tensor, which is NP-complete in general m, is 
another problem for which relatively little work exists. Recent attempts at inferring the 
rank of the tensor in the context of CP decomposition include Ii"8l22l : however (1) 
these methods are not applicable for count data, and (2) the inferred factor matrices are 
real-valued, lacking the type of interpretability needed in many applications. 

Our framework is similar in spirit to the matrix factorization setting proposed in 124) 
which turns out to be a special case of our framework. In addition, while ll24l only 
developed (batch) Gibbs sampling based inference, we present both Gibbs sampling as 
well as variational Bayesian inference, and design efficient online Bayesian inference 
methods to scale up our framework for handling massive real-world tensor data. 

To summarize, in contrast to the existing methods for analyzing tensors, our fully 
Bayesian framework, based on a proper generative model, provides a flexible method 
for analyzing massive count-valued tensors, side-stepping crucial issues such as rank- 
specification, providing good interpretability of the latent factors, while still being scal¬ 
able for analyzing massive real-world tensors via online Bayesian inference. 

6 Experiments 

We apply the proposed model on a synthetic and three real-world data sets that range in 
their sizes from moderate to medium to massive. The real-world tensor data sets we use 
in our experiments are from diverse application domains, such as analyzing country- 
country interaction data in political science, topic modeling on multiway publications 
data (with entities being authors, words, and publication venues), and analysis of mas¬ 
sive household transactions data. These data sets include: 

- Synthetic Data: This is a tensor of size 300 x 300 x 300 generated using our model 

by setting an upper bound R = 50 over the number of factors; only 20 factors were 

significant (based on the values of A r ), resulting in an effective rank 20. 
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- Political Science Data (GDELT): This is a real-world four-way tensor data of 
country-country interactions. The data consists of 220 countries, 20 action types, 
and the interactions date back to 1979 GS). We focus on a subset of this data col¬ 
lected during the year 2011, resulting in a tensor of size 220 x 220 x 20 x 52. 
Section [tT4| provides further details. 

- Publications Data: This is a 2425 x 9088 x 4068 count-valued tensor, constructed 
from a database of research papers published by researchers at Duke University^ 
the three tensor modes correspond to authors, words, and venues. Section [63] pro¬ 
vides further details. 

- Transactions (Food) Data: This is a 117054 x 438 x 67095 count-valued tensor, 
constructed from a database of transactions data of food item purchases at various 
stores in the US FI the three tensor modes correspond to households, stores, and 
items. Section [63] provides further details. 

We compared our model with the following baselines: (i) Bayesian Poisson Tensor 
Factorization (BayesPTF), which is fully Bayesian version of the Poisson Tensor Fac¬ 
torization model proposed in 0, and (ii) Non-negative Tensor Decomposition based 
on Low-rank Approximation (LRANTD) proposed in l23l . All experiments are done on 
a standard desktop computer with Intel i7 3.4GHz processor and 24GB RAM. 

6.1 Inferring the Rank 

To begin with, as a sanity check for our model, we first perform an experiment on the 
synthetic data described above to see how well the model can recover the true rank 
(tensor completion results are presented separately in Section [63) . For this experiment, 
we run the batch Gibbs sampler (the other inference methods also yield similar results) 
with 1000 burn-ins, and 1000 collection samples. We experiment with three settings: 
using 20%, 50% and 80% data for training. The empirical distribution (estimated using 
the collected MCMC samples) of the effective inferred rank for each of these settings 
is shown in Figure [T] (left). In each collection iteration, the effective rank is computed 
after a simple thresholding on the A r ’s where components with very small X r are not 
counted (also see Figure [T] (right)). With 80% training data, the distribution shows a 
distinct peak at 20 and even with smaller amounts of training data (20% and 50%), the 
inferred rank is fairly close to the ground truth of 20. In Figure [I] (right), we show the 
spectrum of all the A r ’s comparing the ground truth vs the inferred values; 

6.2 Tensor Completion Results 

We next experiment on the task of tensor completion, where for each method 95% of 
the data are used for training and the remaining 5% data is used as the heldout set (note 
that the data sets we use are extremely sparse in nature, with considerably less than 
1% entries of the tensor being actually observed). The results are reported in Table [T] 

2 Obtained from https : //scholars . duke . edu/ 

3 Data provided by United States Department of Agriculture (USDA) under a Third Party Agree¬ 
ment with Information Resources, Inc. (IRI) 
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Fig. 1. Distribution over inferred ranks for syntheric data (left), and A inferred using 80% training 
data (right). 


where we show the log likelihood and the mean-absolute error (MAE) in predicting the 
heldout data. Timing-comparison for the various batch and online inference methods is 


presented separately in Section 6.6 


For this experiment, we compare our BNBCP model (using the various inference 
methods) with (1) BayesPTF - a fully Bayesian variant (we implented it ourselves) of 
a state-of-the-art Poisson Tensor Factorization model originally proposed in J4J (which 
cannot however handle missing data), and (2) LRANTD l23l which is an optimiza¬ 
tion based non-negative tensor decomposition method. As Table |T] shows, our meth¬ 
ods achieve better log-likelihood and MAE as compared to these baselines. Moreover, 
among our batch and online Bayesian inference methods, the online inference methods 
give competitive or better results as compared to their batch counterparts. In particu¬ 
lar, the online MCMC method based on conditional density filtering (BNBCP-CDF) 
works the best across all the methods (please see Section |6.6|for a timing comparison). 


Datasets 

Toy data 

GDELT 

Publication 

Food 

Toy data GDELT Publication Food 

BayesPTF 

LRANTD 

-107563 

N/A 

-4425695 

N/A 

-860808 

N/A 

-2425433 

N/A 

1.012 

1.019 

55.478 

65.049 

1.636 

N/A 

1.468 

N/A 

BNBCP-GlBBS 

BNBCP-VB 

-97580 

-99381 

-3079883 

-2971769 

-619258 

-632224 

-2512112 

-2533086 

0.989 

0.993 

45.436 

43.485 

1.565 

1.574 

1.459 

1.472 

BNBCP-CDF 

BNBCP-ONLINEVB 

-95472 

-98446 

-2947309 

-3169335 

-597817 

-660068 

-2403094 

-2518996 

0.985 

0.989 

44.243 

46.188 

1.555 

1.601 

1.423 

1.461 


Table 1. Loglikelihood and MAE comparison for different methods (the two baselines, our model 
with batch inference, and our model with online inference) on four datasets. Note: LRANTD gave 
out-of-memory error on publications and food transactions data sets so we are unable to report its 
results on these data sets. We also only report the MAE for LRANTD, and not the log-likelihood, 
because it uses a Gaussian likelihood model for the data. 


6.3 Analyzing Publications Database 

The next experiment is on a three-way tensor constructed from a scientific publications 
database. The data consist of abstracts from papers published by various researchers 
at Duke University [j] In addition to the paper abstract, the venue information for each 
paper is also available. The data collection contains 2425 authors, 9088 words (after 
removing stop-words), and 4068 venues which results in a 2425 x 9088 x 4068 word- 
counts tensor, on which we run our model. As the output of the tensor decomposition, 
we get three factor matrices. Since the latent factors in our model are non-negative 
and sum to one, each latent factor can also be interpreted as a distribution over au¬ 
thors/words/venues, and consequently represents a “topic”. Therefore the three factor 

4 Data crawled from https : / /scholars . duke . edu/ 
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matrices inferred by our model for this data correspond to authors x topics, words x 
topics, and venue x topics, which we use to further analyze the data. 

We apply the model BNBCP-CDF on this data (with R = 200) and using the 
inferred words x topics matrix, in Table [2] (left) we show the list of 10 most proba¬ 
ble words in four factors/topics that seem to correspond to optics, genomics, machine 
learning & signal processing, and statistics. To show the topic representation across 
different departments, we present a histogram of departmental affiliations for 20 au¬ 
thors with highest probabilities in these four factors. We find that, for the genomics 
factor, the top authors (based on their topic scores) have affiliations related to biology 
which makes intuitive sense. Likewise, for the statistics factor, most of the top authors 
are from statistics and biostatistics departments. The top 20 authors in factors that cor¬ 
respond to optics and machine learning & signal processing, on the other hand, are from 
departments of electrical and computer engineering and/or computer science, etc. 


Optics 

Genomics 

ML/SP 

Stats 

Top Venues in ML/SP 

GIGAPIXEL 

GENE 

DICTIONARY 

MODEL 

ICASSP 

MICROCAMERA 

CHROMATIN 

SPARSITY 

PRIORS 

IEEE TRANS. SIG. PROC. 

CAMERAS 

OCCUPANCY 

MODEL 

BAYESIAN 

ICML 

APERTURE 

CENTROMERE 

BAYESIAN 

LASSO 

Siam J. img. sci. 

LENS 

TRANSCRIPTION 

COMPRESSED 

LATENT 

IEEE TRANS. IMG. PROC. 

MULTISCALE 

GENOME 

COMPRESSIVE 

INFERENCE 

IEEE INT. SYMP. BIOMED. IMG. 

OPTICAL 

SITES 

MATRIX 

REGRESSION 

NIPS 

SYSTEM 

EXPRESSION 

DENOISING 

SAMPLER 

IEEE TRANS. WIRELESS COMM. 

NANOPROBES 

SEQUENCE 

GIBBS 

SEMIPARAMETRIC 

IEEE WORKSHOP STAT. SIG. PROC. 

METAMATERIAL 

VEGFA 

NOISE 

NONPARAMETRIC 

IEEE TRANS. INF. THEORY 


Table 2. Most probable words in topics related to optics, genomics, machine learning/signal 
processing(ML/SP) and statistics (Stats), and top ranked venues in ML/SP community. 



Fig. 2. Histogram of affiliations for top 20 authors in factors related to machine learning/signal 
processing (top left) and statistics (top right), optics (bottom left), and genomics(bottom right) 

Similarly, using the inferred venues x topics matrix, we list the most likely venues 
for each topic. Due to space-limitations, here we only present the most likely venues in 
machine learning & signal processing factor/topic; the result is shown in Table [2] (right¬ 
most column). The result shows that venues like ICASSP, IEEE Trans. Signal Proc., 
ICML, and NIPS all rank at the top in the machine learning & signal processing factor, 
which again makes intuitive sense. 
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6.4 Analyzing Political Science Data 

We use the model to analyze a real-world political science data set consisting of country- 
country interactions. Such analyses are typically done by political scientists to study, 
analyze and understand complex international multilateral relations among countries. 
The data set is from the Global Database of Events, Location, and Tone (GDELT) na. 
GDELT records the dyadic interactions between countries in the form of “Country A 
did something to Country B”. In our experiments, we consider 220 countries (“ac¬ 
tors”) and 20 unique high-level action types in 52 weeks of year 2012. After prepro¬ 
cessing, we have a four-way (country-country-action-time) action counts tensor of size 
220 x 220 x 20 x 52. Note that both first and second tensor mode represents countries; 
first mode as “sender” and the second mode as “receiver” of a particular action. In this 
analysis, we set R to be large enough (200) and the model discovered roughly about 
120 active components (i.e., components with significant value of A r ). 


Julian Assange Asylum in Ecuador 
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Fig. 3. Country factors (top row) and time factors (bottom row) for Julian Assange asylum in 
Ecuador (left column) and 2012 Benghazi attack (right column). 

We apply the model (BNBCP-CDF; other methods yield similar results) and ex¬ 
amine each of the time dimension factors, specifically looking for the significant com¬ 
ponents (based on the magnitude of A r ) in which the time dimension factor also peaks 
during certain time(s) of the year. We show results with two such factors in Figure [3] 
In Figure [3] (column 1), the time and country (actor) factors seems to suggest that this 
factor/topic corresponds to the event “Julian Assange”. The actor subplot shows spikes 
at Ecuador, United Kingdom, United States, and Sweden whereas the time factor in the 
bottom left subplot shows spikes between June and August. The time and countries in¬ 
volved are consistent with the public knowledge of the event of Julian Assange seeking 
refuge in Ecuador. 

Likewise, in Figure [3] (column 2), the time and country (actor) factors seems to 
suggest that this factor corresponds to the event “Benghazi Attack” which took place on 
Sept. 12 (week 37) of 2012, in which Islamic militants attacked American diplomatic 
compound in Benghazi, Libya. The attack killed an US Ambassador. As the Figure 
shows, the top actors identified are US, Libya and Egypt, and spikes are found at around 
week 37 and 38, which are consistent with the public knowledge of this event. 

The results of these analyses demonstrate that the interpretability of our model can 
be useful for identifying events or topics in such multiway interaction data. 
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6.5 Analyzing Transactions Data 

We next apply our model (BNBCP-CDF; other methods yield similar results) for an¬ 
alyzing transactions data for food item purchases made at stores. Our data is collected 
for a demographically representative sample of US consumers who reside in large ur¬ 
ban and suburban areas and purchase food in supermarkets and grocery stores. The data 
were provided by the USDA under a Third Party Agreement with IRI. Each transaction 
is identified by a unique Universal Product Code (UPC) barcode and the store where 
the transaction occurred. Some items such as fresh produce do not have UPCs and are 
identified separately. The households are observed over a four year period, during which 
they are provided with a technology that allows them to scan each purchase and record 
additional information such as the store where the purchase was made (and other eco¬ 
nomic data). Participating households are provided with incentives designed to encour¬ 
age compliance. For each household-product-store combination we record the num¬ 
ber of unique purchases over the sampling period. The database has a total of 117,054 
unique households, 438 stores, and 67,095 unique items and we construct a 3-way count 
tensor of size 117054 x 438 x 67095 with about 6.2 million nonzero entries. 

We apply the proposed model on this data by setting R = 100 (out of which about 
60 components were inferred to have a significant value of A r ) and looked at the stores 
factor matrix. Since each column (which sums to 1) of the store factor matrix can be 
thought of as a distribution over the stores, we look at three of the factors from the store 
factor matrix and tried to identify the stores that rank at the top in that factor. In Table [3] 
we show results from each of these factors. Factor 1 seems to suggest that it is about the 
most popular stores (included Walmart, for example). Factor 2 has stores that primarily 
deal in wholesale (e.g., Costco, Sam’s Wholesale Club), and Factor 3 contains stores 
that sell none or very few food items (e.g., Mobil, Petco). Note that the Walmart Super 
Center figures prominently in both Factor 1 and Factor 2. 


Factor 1 

Factor 2 

Factor 3 

Walmart Sup. Center Sam’s Club 

Dick’s Sporting 

Walmart Traders 

Meijer 

Mobil 

Walmart Neighb. 

Costco 

Petco 

Walmart 

B J’S Wholesale 

Sally Beauty 

Kroger 

Walmart Sup. Center GNC All 


Table 3. Three of the store factors inferred from the transaction data (top-5 stores shown for each) 

We next look at the items factor matrix. In Figure[2] we plot the inferred distribution 
over items in each of the three clusters described above. For factors 1 and 2 (which cor¬ 
respond to the most popular stores and wholesale stores respectively), the distribution 
over the items (top and bottom panel in Figure |2j have a reasonably significant mass 
over a certain range of items (for the items indexed towards the left side in the plots of 
factors 1 and 2). On the other hand, for factor 3 which corresponds to stores that sell no 
or very few types of food items, the distribution over the items is rather flat and diffuse 
with very weak intensities (looking at the scale on the y axis). From the Figure [2] it is 
also interesting to observe that the set of active items in factors (1 & 2) vs factor 3 seem 
to be mostly disjoint. 

This analysis provides a first attempt to analyze food shopping patterns for Ameri¬ 
can consumers on a large scale. As the world, at large, struggles with a combination of 
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Fig. 4. Distributions over items for three factors (each factor corresponds to a cluster). 

increasing obesity rates and food insecurity, this analysis shows that consumer prefer¬ 
ences are densely clustered across both stores and items. This indicates that household 
tend to have fairly rigid preferences over the stores where they shop. Furthermore, they 
tend to consume a relatively small number of products from the universe of available 
products. The concentration in both stores and products is indicative of limited search 
behavior and substantial behavioral rigidity which may be associated with suboptimal 
outcomes in terms of nutrition and health. 

6.6 Scalability 

We now perform an experiment comparing the proposed inference methods (batch 
and online) to assess their scalability (Figure [5]). We first use the Transactions data 
(117054 x 438 x 67095) for this experiment. We would like to note that the state-of- 
the-art methods for count-valued tensor, such as the Poisson Tensor Factorization (PTF) 
method from the Tensor Toolbox E), are simply infeasible to run on this data because 
of storage explosion issue (the method requires expensive flattening operations of the 
tensor). The other baseline LRANTD Il23l we used in our experiments was also infea¬ 
sible to run on this data. We set R = 100 for each method (about 60 factors were found 
to be significant, based on the inferred values of the A r ’s) and use a minibatch size 
of 100000 for all the online inference methods. For the conditional density filtering as 
well as stochastic variational inference, we set the learning rate as t 0 = 0 and re = 0.5. 
Figure[5]shows that online inference methods (conditional density filtering and stochas¬ 
tic variational inference) converge much faster to a good solution than batch methods. 
This experiment shows that our online inference methods can be computationally viable 
alternatives if their batch counterparts are slow/infeasible to mn on such data. 


Food Transactions Data 



Fig. 5. Time vs heldout log likelihoods with various methods on transactions data 
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We then perform another experiment on the Scholars data, on which the PTF method 
of g) was feasible to run and compare its per-iteration running time with our model 
(using both batch as well as online inference). Since PTF cannot handle missing data, 
for this experiment, each method was run with all the data. As Fig[6]shows, our methods 
have running times that are considerably smaller than that of PTF. 


7 Conclusion 

We have presented a fully Bayesian 
framework for analyzing massive tensors 
with count data, and have designed a 
suite of scalable inference algorithms for 
handling massive tensor data. In addi¬ 
tion to giving interpretable results and in¬ 
ferring the rank from the data, the pro¬ 
posed model can infer the distribution 
over objects in each of the tensor modes 
which can be useful for understanding 
groups of similar objects, and also for do¬ 
ing other types of qualitative analyses on F 'g- 6 - Timing comparison of various methods 
such data, as shown by our various ex- on Scholars data 
periments on real-world data sets. Sim¬ 
plicity of the inference procedure also makes the proposed model amenable for parallel 
and distributed implementations, e.g., using MapReduce or Hadoop. The model can be 
a useful tool for analyzing data from diverse applications and scalability of the model 
opens door to the application of scalable Bayesian methods for analyzing massive mul¬ 
tiway count data. 
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